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Abstract 

We  propose  a  new  phrase-based  translation 
model  and  decoding  algorithm  that  enables 
us  to  evaluate  and  compare  several,  previ¬ 
ously  proposed  phrase-based  translation  mod¬ 
els.  Within  our  framework,  we  carry  out  a 
large  number  of  experiments  to  understand  bet¬ 
ter  and  explain  why  phrase-based  models  out¬ 
perform  word-based  models.  Our  empirical  re¬ 
sults,  which  hold  for  all  examined  language 
pairs,  suggest  that  the  highest  levels  of  perfor¬ 
mance  can  be  obtained  through  relatively  sim¬ 
ple  means:  heuristic  learning  of  phrase  trans¬ 
lations  from  word-based  alignments  and  lexi¬ 
cal  weighting  of  phrase  translations.  Surpris¬ 
ingly,  learning  phrases  longer  than  three  words 
and  learning  phrases  from  high-accuracy  word- 
level  alignment  models  does  not  have  a  strong 
impact  on  performance.  Learning  only  syntac¬ 
tically  motivated  phrases  degrades  the  perfor¬ 
mance  of  our  systems. 

1  Introduction 

Various  researchers  have  improved  the  quality  of  statis¬ 
tical  machine  translation  system  with  the  use  of  phrase 
translation.  Och  et  al.  [1999]’s  alignment  template  model 
can  be  reframed  as  a  phrase  translation  system;  Yamada 
and  Knight  [2001]  use  phrase  translation  in  a  syntax- 
based  translation  system;  Marcu  and  Wong  [2002]  in¬ 
troduced  a  joint-probability  model  for  phrase  translation; 
and  the  CMU  and  IBM  word-based  statistical  machine 
translation  systems1  are  augmented  with  phrase  transla¬ 
tion  capability. 

Phrase  translation  clearly  helps,  as  we  will  also  show 
with  the  experiments  in  this  paper.  But  what  is  the  best 
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method  to  extract  phrase  translation  pairs?  In  order  to 
investigate  this  question,  we  created  a  uniform  evaluation 
framework  that  enables  the  comparison  of  different  ways 
to  build  a  phrase  translation  table. 

Our  experiments  show  that  high  levels  of  performance 
can  be  achieved  with  fairly  simple  means.  In  fact, 
for  most  of  the  steps  necessary  to  build  a  phrase-based 
system,  tools  and  resources  are  freely  available  for  re¬ 
searchers  in  the  field.  More  sophisticated  approaches  that 
make  use  of  syntax  do  not  lead  to  better  performance.  In 
fact,  imposing  syntactic  restrictions  on  phrases,  as  used  in 
recently  proposed  syntax-based  translation  models  [Ya¬ 
mada  and  Knight,  2001],  proves  to  be  harmful.  Our  ex¬ 
periments  also  show,  that  small  phrases  of  up  to  three 
words  are  sufficient  for  obtaining  high  levels  of  accuracy. 

Performance  differs  widely  depending  on  the  methods 
used  to  build  the  phrase  translation  table.  We  found  ex¬ 
traction  heuristics  based  on  word  alignments  to  be  better 
than  a  more  principled  phrase-based  alignment  method. 
However,  what  constitutes  the  best  heuristic  differs  from 
language  pair  to  language  pair  and  varies  with  the  size  of 
the  training  corpus. 

2  Evaluation  Framework 

In  order  to  compare  different  phrase  extraction  methods, 
we  designed  a  uniform  framework.  We  present  a  phrase 
translation  model  and  decoder  that  works  with  any  phrase 
translation  table. 

2.1  Model 

The  phrase  translation  model  is  based  on  the  noisy  chan¬ 
nel  model.  We  use  Bayes  rule  to  reformulate  the  transla¬ 
tion  probability  for  translating  a  foreign  sentence  f  into 
English  e  as 

argmaxep(e|f)  =  argmaxep(f  |e)p(e) 

This  allows  for  a  language  model  p(e)  and  a  separate 
translation  model  p(f  |e). 
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During  decoding,  the  foreign  input  sentence  f  is  seg¬ 
mented  into  a  sequence  of  I  phrases  //.  We  assume  a 
uniform  probability  distribution  over  all  possible  segmen¬ 
tations. 

Each  foreign  phrase  /,  in  //  is  translated  into  an  En¬ 
glish  phrase  e j.  The  English  phrases  may  be  reordered. 
Phrase  translation  is  modeled  by  a  probability  distribution 
0{fi\ei).  Recall  that  due  to  the  Bayes  rule,  the  translation 
direction  is  inverted  from  a  modeling  standpoint. 

Reordering  of  the  English  output  phrases  is  modeled 
by  a  relative  distortion  probability  distribution  d(a*  — 
&i-i),  where  a,  denotes  the  start  position  of  the  foreign 
phrase  that  was  translated  into  the  *th  English  phrase,  and 
i  denotes  the  end  position  of  the  foreign  phrase  trans¬ 
lated  into  the  ( i  —  l)th  English  phrase. 

In  all  our  experiments,  the  distortion  probability  distri¬ 
bution  d(- )  is  trained  using  a  joint  probability  model  (see 
Section  3.3).  Alternatively,  we  could  also  use  a  simpler 
distortion  model  d(oq  —  frj_i)  =  a)0;-6*-!-1!  with  an 
appropriate  value  for  the  parameter  a. 

In  order  to  calibrate  the  output  length,  we  introduce  a 
factor  u)  for  each  generated  English  word  in  addition  to 
the  trigram  language  model  ppM-  This  is  a  simple  means 
to  optimize  performance.  Usually,  this  factor  is  larger 
than  1,  biasing  longer  output. 

In  summary,  the  best  English  output  sentence  epest 
given  a  foreign  input  sentence  f  according  to  our  model 
is 

ebest  =  argmaxep(e|f) 

=  argmaxep(f|e)pLM(e)wlenSth(e) 

where  p(f\e)  is  decomposed  into 
/ 

p(R \e[)  =  Y[&{fi\ei)d(ai  -  i) 

i= 1 

For  all  our  experiments  we  use  the  same  training  data, 
trigram  language  model  [Seymore  and  Rosenfeld,  1997], 
and  a  specialized  decoder. 

2.2  Decoder 

The  phrase-based  decoder  we  developed  for  purpose  of 
comparing  different  phrase-based  translation  models  em¬ 
ploys  a  beam  search  algorithm,  similar  to  the  one  by  Je- 
linek  [1998].  The  English  output  sentence  is  generated 
left  to  right  in  form  of  partial  translations  (or  hypothe¬ 
ses). 

We  start  with  an  initial  empty  hypothesis.  A  new  hy¬ 
pothesis  is  expanded  from  an  existing  hypothesis  by  the 
translation  of  a  phrase  as  follows:  A  sequence  of  un¬ 
translated  foreign  words  and  a  possible  English  phrase 
translation  for  them  is  selected.  The  English  phrase  is  at¬ 
tached  to  the  existing  English  output  sequence.  The  for¬ 
eign  words  are  marked  as  translated  and  the  probability 
cost  of  the  hypothesis  is  updated. 


The  cheapest  (highest  probability)  final  hypothesis 
with  no  untranslated  foreign  words  is  the  output  of  the 
search. 

The  hypotheses  are  stored  in  stacks.  The  stack  sm 
contains  all  hypotheses  in  which  m  foreign  words  have 
been  translated.  We  recombine  search  hypotheses  as  done 
by  Och  et  al.  [2001].  While  this  reduces  the  number  of 
hypotheses  stored  in  each  stack  somewhat,  stack  size  is 
exponential  with  respect  to  input  sentence  length.  This 
makes  an  exhaustive  search  impractical. 

Thus,  we  prune  out  weak  hypotheses  based  on  the  cost 
they  incurred  so  far  and  a  future  cost  estimate.  For  each 
stack,  we  only  keep  a  beam  of  the  best  n  hypotheses. 
Since  the  future  cost  estimate  is  not  perfect,  this  leads  to 
search  errors.  Our  future  cost  estimate  takes  into  account 
the  estimated  phrase  translation  cost,  but  not  the  expected 
distortion  cost. 

We  compute  this  estimate  as  follows:  For  each  possi¬ 
ble  phrase  translation  anywhere  in  the  sentence  (we  call 
it  a  translation  option ),  we  multiply  its  phrase  translation 
probability  with  the  language  model  probability  for  the 
generated  English  phrase.  As  language  model  probabil¬ 
ity  we  use  the  unigram  probability  for  the  first  word,  the 
bigram  probability  for  the  second,  and  the  trigram  proba¬ 
bility  for  all  following  words. 

Given  the  costs  for  the  translation  options,  we  can  com¬ 
pute  the  estimated  future  cost  for  any  sequence  of  con¬ 
secutive  foreign  words  by  dynamic  programming.  Note 
that  this  is  only  possible,  since  we  ignore  distortion  costs. 
Since  there  are  only  n(n  +  l)/2  such  sequences  for  a 
foreign  input  sentence  of  length  n,  we  can  pre-compute 
these  cost  estimates  beforehand  and  store  them  in  a  table. 

During  translation,  future  costs  for  uncovered  foreign 
words  can  be  quickly  computed  by  consulting  this  table. 
If  a  hypothesis  has  broken  sequences  of  untranslated  for¬ 
eign  words,  we  look  up  the  cost  for  each  sequence  and 
take  the  product  of  their  costs. 

The  beam  size,  e.g.  the  maximum  number  of  hypothe¬ 
ses  in  each  stack,  is  fixed  to  a  certain  number.  The 
number  of  translation  options  is  linear  with  the  sentence 
length.  Hence,  the  time  complexity  of  the  beam  search  is 
quadratic  with  sentence  length,  and  linear  with  the  beam 
size. 

Since  the  beam  size  limits  the  search  space  and  there¬ 
fore  search  quality,  we  have  to  find  the  proper  trade-off 
between  speed  (low  beam  size)  and  performance  (high 
beam  size).  For  our  experiments,  a  beam  size  of  only 
100  proved  to  be  sufficient.  With  larger  beams  sizes, 
only  few  sentences  are  translated  differently.  With  our 
decoder,  translating  1755  sentence  of  length  5-15  words 
takes  about  10  minutes  on  a  2  GHz  Linux  system.  In 
other  words,  we  achieved  fast  decoding,  while  ensuring 
high  quality. 


3  Methods  for  Learning  Phrase 
Translation 


phrases  to  syntactically  motivated  phrases  could  filter  out 
such  non-intuitive  pairs. 


We  carried  out  experiments  to  compare  the  performance 
of  three  different  methods  to  build  phrase  translation 
probability  tables.  We  also  investigate  a  number  of  varia¬ 
tions.  We  report  most  experimental  results  on  a  German- 
English  translation  task,  since  we  had  sufficient  resources 
available  for  this  language  pair.  We  confirm  the  major 
points  in  experiments  on  additional  language  pairs. 

As  the  first  method,  we  learn  phrase  alignments  from 
a  corpus  that  has  been  word-aligned  by  a  training  toolkit 
for  a  word-based  translation  model:  the  Giza++  [Och  and 
Ney,  2000]  toolkit  for  the  IBM  models  [Brown  et  ah, 
1993].  The  extraction  heuristic  is  similar  to  the  one  used 
in  the  alignment  template  work  by  Och  et  al.  [1999], 

A  number  of  researchers  have  proposed  to  focus  on 
the  translation  of  phrases  that  have  a  linguistic  motiva¬ 
tion  [Yamada  and  Knight,  2001;  Imamura,  2002],  They 
only  consider  word  sequences  as  phrases,  if  they  are  con¬ 
stituents,  i.e.  subtrees  in  a  syntax  tree  (such  as  a  noun 
phrase).  To  identify  these,  we  use  a  word-aligned  corpus 
annotated  with  parse  trees  generated  by  statistical  syntac¬ 
tic  parsers  [Collins,  1997;  Schmidt  and  Schulte  im  Walde, 
2000], 

The  third  method  for  comparison  is  the  joint  phrase 
model  proposed  by  Marcu  and  Wong  [2002].  This  model 
learns  directly  a  phrase-level  alignment  of  the  parallel 
corpus. 

3.1  Phrases  from  Word-Based  Alignments 

The  Giza++  toolkit  was  developed  to  train  word-based 
translation  models  from  parallel  corpora.  As  a  by¬ 
product,  it  generates  word  alignments  for  this  data.  We 
improve  this  alignment  with  a  number  of  heuristics, 
which  are  described  in  more  detail  in  Section  4.5. 

We  collect  all  aligned  phrase  pairs  that  are  consistent 
with  the  word  alignment:  The  words  in  a  legal  phrase  pair 
are  only  aligned  to  each  other,  and  not  to  words  outside 
[Och  et  al.,  1999], 

Given  the  collected  phrase  pairs,  we  estimate  the 
phrase  translation  probability  distribution  by  relative  fre¬ 
quency: 

count(/,  e) 

L ;  count(/.  e) 

No  smoothing  is  performed. 

3.2  Syntactic  Phrases 

If  we  collect  all  phrase  pairs  that  are  consistent  with  word 
alignments,  this  includes  many  non-intuitive  phrases.  For 
instance,  translations  for  phrases  such  as  “house  the” 
may  be  learned.  Intuitively  we  would  be  inclined  to  be¬ 
lieve  that  such  phrases  do  not  help:  Restricting  possible 


<t>(f  |e) 


Another  motivation  to  evaluate  the  performance  of 
a  phrase  translation  model  that  contains  only  syntactic 
phrases  comes  from  recent  efforts  to  built  syntactic  trans¬ 
lation  models  [Yamada  and  Knight,  2001;  Wu,  1997].  In 
these  models,  reordering  of  words  is  restricted  to  reorder¬ 
ing  of  constituents  in  well-formed  syntactic  parse  trees. 
When  augmenting  such  models  with  phrase  translations, 
typically  only  translation  of  phrases  that  span  entire  syn¬ 
tactic  subtrees  is  possible.  It  is  important  to  know  if  this 
is  a  helpful  or  harmful  restriction. 

Consistent  with  Imamura  [2002],  we  define  a  syntac¬ 
tic  phrase  as  a  word  sequence  that  is  covered  by  a  single 
subtree  in  a  syntactic  parse  tree. 

We  collect  syntactic  phrase  pairs  as  follows:  We  word- 
align  a  parallel  corpus,  as  described  in  Section  3.1.  We 
then  parse  both  sides  of  the  corpus  with  syntactic  parsers 
[Collins,  1997;  Schmidt  and  Schulte  im  Walde,  2000], 
For  all  phrase  pairs  that  are  consistent  with  the  word 
alignment,  we  additionally  check  if  both  phrases  are  sub¬ 
trees  in  the  parse  trees.  Only  these  phrases  are  included 
in  the  model. 

Hence,  the  syntactically  motivated  phrase  pairs  learned 
are  a  subset  of  the  phrase  pairs  learned  without  knowl¬ 
edge  of  syntax  (Section  3.1). 

As  in  Section  3.1,  the  phrase  translation  probability 
distribution  is  estimated  by  relative  frequency. 


3.3  Phrases  from  Phrase  Alignments 

Marcu  and  Wong  [2002]  proposed  a  translation  model 
that  assumes  that  lexical  correspondences  can  be  estab¬ 
lished  not  only  at  the  word  level,  but  at  the  phrase  level 
as  well.  To  learn  such  correspondences,  they  introduced  a 
phrase-based  joint  probability  model  that  simultaneously 
generates  both  the  Source  and  Target  sentences  in  a  paral¬ 
lel  corpus.  Expectation  Maximization  learning  in  Marcu 
and  Wong’s  framework  yields  both  (i)  a  joint  probabil¬ 
ity  distribution  ^(e,  /),  which  reflects  the  probability  that 
phrases  e  and  /  are  translation  equivalents;  (ii)  and  a  joint 
distribution  d(i,j ),  which  reflects  the  probability  that  a 
phrase  at  position  i  is  translated  into  a  phrase  at  position 
j.  To  use  this  model  in  the  context  of  our  framework,  we 
simply  marginalize  to  conditional  probabilities  the  joint 
probabilities  estimated  by  Marcu  and  Wong  [2002].  Note 
that  this  approach  is  consistent  with  the  approach  taken 
by  Marcu  and  Wong  themselves,  who  use  conditional 
models  during  decoding. 


Method 

Training  corpus  size  i 

10k 

20k 

40k 

80k 

160k 

320k 

AP 

84k 

176k 

370k 

736k 

1536k 

3152k 

Joint 

125k 

220k 

400k 

707k 

1254k 

2214k 

Syn 

19k 

24k 

67k 

105k 

217k 

373k 

Table  1:  Size  of  the  phrase  translation  table  in  terms  of 
distinct  phrase  pairs  (maximum  phrase  length  4) 


4  Experiments 

We  used  the  freely  available  Europarl  corpus  2  to  carry 
out  experiments.  This  corpus  contains  over  20  million 
words  in  each  of  the  eleven  official  languages  of  the  Eu¬ 
ropean  Union,  covering  the  proceedings  of  the  European 
Parliament  1996-2001.  1755  sentences  of  length  5-15 
were  reserved  for  testing. 

In  all  experiments  in  Section  4. 1-4.6  we  translate  from 
German  to  English.  We  measure  performance  using  the 
BLEU  score  [Papineni  et  al.,  2001],  which  estimates  the 
accuracy  of  translation  output  with  respect  to  a  reference 
translation. 

4.1  Comparison  of  Core  Methods 

First,  we  compared  the  performance  of  the  three  methods 
for  phrase  extraction  head-on,  using  the  same  decoder 
(Section  2)  and  the  same  trigram  language  model.  Fig¬ 
ure  1  displays  the  results. 

In  direct  comparison,  learning  all  phrases  consistent 
with  the  word  alignment  (AP)  is  superior  to  the  joint 
model  (Joint),  although  not  by  much.  The  restriction  to 
only  syntactic  phrases  (Syn)  is  harmful.  We  also  included 
in  the  figure  the  performance  of  an  IBM  Model  4  word- 
based  translation  system  (M4),  which  uses  a  greedy  de¬ 
coder  [Germann  et  al.,  2001].  Its  performance  is  worse 
than  both  AP  and  Joint.  These  results  are  consistent 
over  training  corpus  sizes  from  10,000  sentence  pairs  to 
320,000  sentence  pairs.  All  systems  improve  with  more 
data. 

Table  1  lists  the  number  of  distinct  phrase  translation 
pairs  learned  by  each  method  and  each  corpus.  The  num¬ 
ber  grows  almost  linearly  with  the  training  corpus  size, 
due  to  the  large  number  of  singletons.  The  syntactic  re¬ 
striction  eliminates  over  80%  of  all  phrase  pairs. 

Note  that  the  millions  of  phrase  pairs  learned  fit  easily 
into  the  working  memory  of  modern  computers.  Even  the 
largest  models  take  up  only  a  few  hundred  megabyte  of 
RAM. 

4.2  Weighting  Syntactic  Phrases 

The  restriction  on  syntactic  phrases  is  harmful,  because 
too  many  phrases  are  eliminated.  But  still,  we  might  sus¬ 
pect,  that  these  lead  to  more  reliable  phrase  pairs. 

2The  Europarl  corpus  is  available  at 

http : / /www . isi . edu/ ~koehn/ europarl/ 


Training  Corpus  Size 

Figure  1:  Comparison  of  the  core  methods:  all  phrase 
pairs  consistent  with  a  word  alignment  (AP),  phrase  pairs 
from  the  joint  model  (Joint),  IBM  Model  4  (M4),  and 
only  syntactic  phrases  (Syn) 


One  way  to  check  this  is  to  use  all  phrase  pairs  and  give 
more  weight  to  syntactic  phrase  translations.  This  can  be 
done  either  during  the  data  collection  -  say,  by  counting 
syntactic  phrase  pairs  twice  -  or  during  translation  -  each 
time  the  decoder  uses  a  syntactic  phrase  pair,  it  credits  a 
bonus  factor  to  the  hypothesis  score. 

We  found  that  neither  of  these  methods  result  in  signif¬ 
icant  improvement  of  translation  performance.  Even  pe¬ 
nalizing  the  use  of  syntactic  phrase  pairs  does  not  harm 
performance  significantly.  These  results  suggest  that  re¬ 
quiring  phrases  to  be  syntactically  motivated  does  not 
lead  to  better  phrase  pairs,  but  only  to  fewer  phrase  pairs, 
with  the  loss  of  a  good  amount  of  valuable  knowledge. 

One  illustration  for  this  is  the  common  German  “es 
gibt”,  which  literally  translates  as  “it  gives”,  but  really 
means  “there  is”.  “Es  gibt”  and  “there  is”  are  not  syn¬ 
tactic  constituents.  Note  that  also  constructions  such  as 
“with  regard  to”  and  “note  that”  have  fairly  complex  syn¬ 
tactic  representations,  but  often  simple  one  word  trans¬ 
lations.  Allowing  to  learn  phrase  translations  over  such 
sentence  fragments  is  important  for  achieving  high  per¬ 
formance. 

4.3  Maximum  Phrase  Length 

How  long  do  phrases  have  to  be  to  achieve  high  perfor¬ 
mance?  Figure  2  displays  results  from  experiments  with 
different  maximum  phrase  lengths.  All  phrases  consis¬ 
tent  with  the  word  alignment  (AP)  are  used.  Surprisingly, 
limiting  the  length  to  a  maximum  of  only  three  words 
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Figure  2:  Different  limits  for  maximum  phrase  length 
show  that  length  3  is  enough 


fl  f 2  f 3 
NULL  —  —  ## 

el  ## - 

e2  —  ##  — 
e3  —  ##  — 

Pw{f\e,a)  —  p™(/i/2/3|eie2e3,a) 

=  «'(/i|ei) 

x^(w(/2|e2)  +  w(/2|e3)) 
x  w(/3|NULL) 

Figure  3:  Lexical  weight  pw  of  a  phrase  pair  (/,  e)  given 
an  alignment  a  and  a  lexical  translation  probability  distri¬ 
bution  w(-) 


Max. 

Length 

j  Training  corpus  size  j 

10k 

20k 

40k 

80k 

160k 

320k 

2 

37k 

70k 

135k 

250k 

474k 

882k 

3 

63k 

128k 

261k 

509k 

1028k 

1996k 

4 

84k 

176k 

370k 

736k 

1536k 

3152k 

5 

101k 

215k 

459k 

925k 

1968k 

4119k 

7 

130k 

278k 

605k 

1217k 

2657k 

5663k 

Table  2:  Size  of  the  phrase  translation  table  with  varying 


maximum  phrase  length  limits 


per  phrase  already  achieves  top  performance.  Learning 
longer  phrases  does  not  yield  much  improvement,  and 
occasionally  leads  to  worse  results.  Reducing  the  limit 
to  only  two,  however,  is  clearly  detrimental. 

Allowing  for  longer  phrases  increases  the  phrase  trans¬ 
lation  table  size  (see  Table  2).  The  increase  is  almost  lin¬ 
ear  with  the  maximum  length  limit.  Still,  none  of  these 
model  sizes  cause  memory  problems. 


4.4  Lexical  Weighting 

One  way  to  validate  the  quality  of  a  phrase  translation 
pair  is  to  check,  how  well  its  words  translate  to  each  other. 
For  this,  we  need  a  lexical  translation  probability  distribu¬ 
tion  w(f  \e).  We  estimated  it  by  relative  frequency  from 
the  same  word  alignments  as  the  phrase  model. 

count  (/,  e) 

E/'  count(/',e) 


A  special  English  NULL  token  is  added  to  each  En¬ 
glish  sentence  and  aligned  to  each  unaligned  foreign 
word. 

Given  a  phrase  pair  /,  e  and  a  word  alignment  a  be¬ 
tween  the  foreign  word  positions  i  =  1. ....  n  and  the 
English  word  positions  j  =  0, 1, ...,  to,  we  compute  the 
lexical  weight  pw  by 

n  1 

=  n  KTicOTiTTi  - 

i=l  IUIV  JIV(i,j)ea 


See  Figure  3  for  an  example. 

If  there  are  multiple  alignments  a  for  a  phrase  pair 
(/,e),  we  use  the  one  with  the  highest  lexical  weight: 

Pw  (/ \e)  =  ma \apw  (/ |e,  a) 

We  use  the  lexical  weight  pw  during  translation  as  a 
additional  factor.  This  means  that  the  model  p(f  |e)  is 
extended  to 

i 

p(f{ \e[)  =  Y[0(fi\ei)d(ai  -  bi-i)pw(fi\ei,a)x 
i=  1 

The  parameter  A  defines  the  strength  of  the  lexical 
weight  pw.  Good  values  for  this  parameter  are  around 
0.25. 

Figure  4  shows  the  impact  of  lexical  weighting  on  ma¬ 
chine  translation  performance.  In  our  experiments,  we 
achieved  improvements  of  up  to  0.01  on  the  BLEU  score 
scale.  Again,  all  phrases  consistent  with  the  word  align¬ 
ment  are  used  (Section  3.1). 

Note  that  phrase  translation  with  a  lexical  weight  is  a 
special  case  of  the  alignment  template  model  [Och  et  al., 
1999]  with  one  word  class  for  each  word.  Our  simplifica¬ 
tion  has  the  advantage  that  the  lexical  weights  can  be  fac¬ 
tored  into  the  phrase  translation  table  beforehand,  speed¬ 
ing  up  decoding.  In  contrast  to  the  beam  search  decoder 
for  the  alignment  template  model,  our  decoder  is  able  to 
search  all  possible  phrase  segmentations  of  the  input  sen¬ 
tence,  instead  of  choosing  one  segmentation  before  de¬ 
coding. 

4.5  Phrase  Extraction  Heuristic 

Recall  from  Section  3.1  that  we  learn  phrase  pairs  from 
word  alignments  generated  by  Giza++.  The  IBM  Models 
that  this  toolkit  implements  only  allow  at  most  one  En¬ 
glish  word  to  be  aligned  with  a  foreign  word.  We  remedy 
this  problem  with  a  heuristic  approach. 


Training  Corpus  Size 

Figure  4:  Lexical  weighting  (lex)  improves  performance 


First,  we  align  a  parallel  corpus  bidirectionally  -  for¬ 
eign  to  English  and  English  to  foreign.  This  gives  us  two 
word  alignments  that  we  try  to  reconcile.  If  we  intersect 
the  two  alignments,  we  get  a  high-precision  alignment  of 
high-confidence  alignment  points.  If  we  take  the  union  of 
the  two  alignments,  we  get  a  high-recall  alignment  with 
additional  alignment  points. 

We  explore  the  space  between  intersection  and  union 
with  expansion  heuristics  that  start  with  the  intersection 
and  add  additional  alignment  points.  The  decision  which 
points  to  add  may  depend  on  a  number  of  criteria: 

•  In  which  alignment  does  the  potential  alignment 
point  exist?  Foreign-English  or  English-foreign? 

•  Does  the  potential  point  neighbor  already  estab¬ 
lished  points? 

•  Does  “neighboring”  mean  directly  adjacent  (block- 
distance),  or  also  diagonally  adjacent? 

•  Is  the  English  or  the  foreign  word  that  the  poten¬ 
tial  point  connects  unaligned  so  far?  Are  both  un¬ 
aligned? 

•  What  is  the  lexical  probability  for  the  potential 
point? 

The  base  heuristic  [Och  et  ah,  1999]  proceeds  as  fol¬ 
lows:  We  start  with  intersection  of  the  two  word  align¬ 
ments.  We  only  add  new  alignment  points  that  exist  in 
the  union  of  two  word  alignments.  We  also  always  re¬ 
quire  that  a  new  alignment  point  connects  at  least  one 
previously  unaligned  word. 

First,  we  expand  to  only  directly  adjacent  alignment 
points.  We  check  for  potential  points  starting  from  the  top 
right  corner  of  the  alignment  matrix,  checking  for  align¬ 
ment  points  for  the  first  English  word,  then  continue  with 
alignment  points  for  the  second  English  word,  and  so  on. 
This  is  done  iteratively  until  no  alignment  point  can  be 
added  anymore.  In  a  final  step,  we  add  non-adjacent 
alignment  points,  with  otherwise  the  same  requirements. 
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Figure  5:  Different  heuristics  to  symmetrize  word  align¬ 
ments  from  bidirectional  Giza++  alignments 


Figure  5  shows  the  performance  of  this  heuristic  (base) 
compared  against  the  two  mono-directional  alignments 
(e2f,  f2e)  and  their  union  (union).  The  figure  also  con¬ 
tains  two  modifications  of  the  base  heuristic:  In  the  first 
(diag)  we  also  permit  diagonal  neighborhood  in  the  itera¬ 
tive  expansion  stage.  In  a  variation  of  this  (diag-and),  we 
require  in  the  final  step  that  both  words  are  unaligned. 

The  ranking  of  these  different  methods  varies  for  dif¬ 
ferent  training  corpus  sizes.  For  instance,  the  alignment 
f2e  starts  out  second  to  worst  for  the  10,000  sentence  pair 
corpus,  but  ultimately  is  competitive  with  the  best  method 
at  320,000  sentence  pairs.  The  base  heuristic  is  initially 
the  best,  but  then  drops  off. 

The  discrepancy  between  the  best  and  the  worst 
method  is  quite  large,  about  0.02  BLEU.  For  almost 
all  training  corpus  sizes,  the  heuristic  diag-and  performs 
best,  albeit  not  always  significantly. 

4.6  Simpler  Underlying  Word-Based  Models 

The  initial  word  alignment  for  collecting  phrase  pairs 
is  generated  by  symmetrizing  IBM  Model  4  alignments. 
Model  4  is  computationally  expensive,  and  only  approxi¬ 
mate  solutions  exist  to  estimate  its  parameters.  The  IBM 
Models  1-3  are  faster  and  easier  to  implement.  For  IBM 
Model  1  and  2  word  alignments  can  be  computed  effi¬ 
ciently  without  relying  on  approximations.  For  more  in¬ 
formation  on  these  models,  please  refer  to  Brown  et  al. 
[1993].  Again,  we  use  the  heuristics  from  the  Section  4.5 
to  reconcile  the  mono-directional  alignments  obtained 
through  training  parameters  using  models  of  increasing 
complexity. 

How  much  is  performance  affected,  if  we  base  word 
alignments  on  these  simpler  methods?  As  Figure  6  indi- 
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Figure  6:  Using  simpler  IBM  models  for  word  alignment 
does  not  reduce  performance  much 


Language  Pair 

Model4 

Phrase 

Lex 

English-German 

0.2040 

0.2361 

0.2449 

French-English 

0.2787 

0.3294 

0.3389 

English-French 

0.2555 

0.3145 

0.3247 

Finnish-English 

0.2178 

0.2742 

0.2806 

Swedish-English 

0.3137 

0.3459 

0.3554 

Chinese-English 

0.1190 

0.1395 

0.1418 

Table  3:  Confirmation  of  our  findings  for  additional  lan¬ 
guage  pairs  (measured  with  BLEU) 


cates,  not  much.  While  Model  1  clearly  results  in  worse 
performance,  the  difference  is  less  striking  for  Model  2 
and  3.  Using  different  expansion  heuristics  during  sym¬ 
metrizing  the  word  alignments  has  a  bigger  effect. 

We  can  conclude  from  this,  that  high  quality  phrase 
alignments  can  be  learned  with  fairly  simple  means.  The 
simpler  and  faster  Model  2  provides  similar  performance 
to  the  complex  Model  4. 

4.7  Other  Language  Pairs 

We  validated  our  findings  for  additional  language  pairs. 
Table  3  displays  some  of  the  results.  For  all  language 
pairs  the  phrase  model  (based  on  word  alignments.  Sec¬ 
tion  3.1)  outperforms  IBM  Model  4.  Lexicalization  (Lex) 
always  helps  as  well. 

5  Conclusion 

We  created  a  framework  (translation  model  and  decoder) 
that  enables  us  to  evaluate  and  compare  various  phrase 
translation  methods.  Our  results  show  that  phrase  transla¬ 
tion  gives  better  performance  than  traditional  word-based 
methods.  We  obtain  the  best  results  even  with  small 


phrases  of  up  to  three  words.  Lexical  weighting  of  phrase 
translation  helps. 

Straight-forward  syntactic  models  that  map  con¬ 
stituents  into  constituents  fail  to  account  for  important 
phrase  alignments.  As  a  consequence,  straight-forward 
syntax-based  mappings  do  not  lead  to  better  translations 
than  unmotivated  phrase  mappings.  This  is  a  challenge 
for  syntactic  translation  models. 

It  matters  how  phrases  are  extracted.  The  results  sug¬ 
gest  that  choosing  the  right  alignment  heuristic  is  more 
important  than  which  model  is  used  to  create  the  initial 
word  alignments. 
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